Creating a corpus of geospatial language

نویسندگان

  • Kristin Stock
  • Robert C Pasley
  • Zoe Gardner
  • Paul Brindley
  • Jeremy Morley
  • Claudia Cialone
چکیده

The description of location using natural language is of interest for a number of research activities in geography, linguistics and cognitive science, including the development of methods for automated interpretation and generation of natural language to ease interaction with geographic information systems, as well as a number of related endeavours. For such research activities, examples of geospatial language are usually collected from the personal knowledge of researchers, or in small scale collection activities specific to the project concerned. This paper describes the process used to develop a more generic corpus of geospatial language. While the motivation for development was the authors’ ongoing research into natural language geospatial querying, it also has wider applications across a range of research areas. The paper describes the development and evaluation of four methods for semiautomated harvesting of geospatial language clauses from text to create a corpus of geospatial language. The most successful methods use a set of geospatial syntactic templates that describe common patterns of grammatical geospatial word categories, combined with extensible lists of members of those categories. The best method provides a maximum precision of 0.66, and is being used in an ongoing programme to harvest content from a range of web sites, followed by manual confirmation of content to create a monitor corpus. * Corresponding author. Email: [email protected] The stated aim of maximising the range of language collected (rather than focussing on the most common examples) across a range of English dialects presented challenges for the approach, as the wider range of included language also increased the rate at which text was incorrectly identified as geospatial when in fact it was not. The method was also limited in its ability to exclude metaphoric use of spatial language and sporting references. When run against actual web sites, the proportion of automatically harvested clauses that were manually confirmed as geospatial varied widely depending on the web source concerned, ranging from 5% to 78%, with an overall total of 42% for the first 31 sites harvested.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TESLA: A Tool for Annotating Geospatial Language Corpora

In this paper, we present The gEoSpatial Language Annotator (TESLA)—a tool which supports human annotation of geospatial language corpora. TESLA interfaces with a GIS database for annotating grounded geospatial entities and uses Google Earth for visualization of both entity search results and evolving object and speaker position from GPS tracks. We also discuss a current annotation effort using...

متن کامل

Mining Geospatial Path Data from Natural Language Descriptions

In this paper, we describe the TEGUS system for mining geospatial path data from natural language descriptions. TEGUS uses natural language processing, GIS entity databases, and graph-based path finding to predict lat/lon paths based only on natural language text input. We also report on preliminary results from experiments on a corpus of path descriptions.

متن کامل

Street-Level Geolocation From Natural Language Descriptions

In this article, we describe the TEGUS system for mining geospatial path data from natural language descriptions. TEGUS uses natural language processing and geospatial databases to recover path coordinates from user descriptions of paths at street level. We also describe the PURSUIT Corpus — an annotated corpus of geospatial path descriptions in spoken natural language. PURSUIT includes the spo...

متن کامل

Exploring the Potential of a Mobile Messaging Application for Self-Initiated Language Learning

With the rapid expansion of deploying mobile instant messaging applications such as Telegram for the purpose of language learning, it is quite apparent that language research in this regard is lagging behind the trend. This study addressed the matter by exploring how language learners utilize a Telegram group for the purpose of language learning. In this regard, the activities of a Telegram lan...

متن کامل

The Language of Place: Semantic Value from Geospatial Context

There is a relationship between what we say and where we say it. Word embeddings are usually trained assuming that semantically-similar words occur within the same textual contexts. We investigate the extent to which semantically-similar words occur within the same geospatial contexts. We enrich a corpus of geolocated Twitter posts with physical data derived from Google Places and OpenStreetMap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012